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Abstract 

Background: Changes in the order of mitochondrial genes are a good source of information for phylogenetic 
investigations. Phylogenetic hypotheses are often supported by parsimonious mitochondrial gene order 
rearrangement scenarios. CREx is a heuristic for computing short pairwise rearrangement scenarios for metazoan 
mitochondrial gene orders. Different from other methods, CREx considers four types of rearrangement operations: 
inversions, transpositions, inverse transpositions, and tandem duplication random loss operations. 

Results: An extensive analysis of the CREx reconstructions for artificial data has been presented and it is shown 
how the quality of the reconstructed rearrangement scenarios depends on the type of rearrangement model and 
additional parameter values. Moreover, a fast method is proposed to apply CREx to a large number of gene orders 
to find likely rearrangement scenarios and store them in a graph structure called Rl-Graph. This method is applied 
to analyse all known metazoan mitochondrial gene orders. It is shown that the obtained Rl-Graph contains many 
rearrangement scenarios that are described in the literature. 

Conclusions: The prospects and limitations of CREx have been analysed empirically and a comparison with the 
literature on gene order evolution highlights its benefits. The newly developed method to apply CREx to a large 
number of gene orders is successful in computing an Rl-graph that contains many rearrangement scenarios for 
metazoan gene orders that have also been described in the literature. This shows that the new method is very 
helpful for a fast analysis of a large number of gene orders which is relevant due to the strongly increasing 
number of known gene orders. 



Background 

Phylogenetic hypotheses are often supported by the 
computation of parsimonious scenarios of genome-wide 
rearrangement operations. Especially mitochondrial gene 
orders became a very fruitful source for such investiga- 
tions as they are known for more than 1700 metazoan 
species. Furthermore they exhibit a small and usually 
preserved gene set [1]. Therefore, we focus in this paper 
on the case of sets gene orders that all have the same 
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set of genes and do not have duplicated genes. It is 
assumed that several types of rearrangement operations 
have shaped the gene orders observed today. Inversions 
and transpositions are well documented [2]. Also inverse 
transposition, i.e., a transposition where the transposed 
part is inverted, have been found several times [1], [3-5]. 
Recent studies add tandem duplication random loss 
(TDRL) to the set of rearrangement operations that are 
relevant for mitochondrial gene order evolution [6-10]. 
A TDRL consists of a tandem duplication, i.e., a duplica- 
tion of a continuous segment of genes such that the ori- 
ginal segment and its copy are consecutive, followed by 
the loss of one copy of each of the redundant genes. 
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The reconstruction of the evolution of gene orders for 
only three given gene orders is in almost all studied 
cases a NP-hard problem [11,12], e.g., even when only 
inversions are considered. This complicates studies con- 
sidering combinations of different types of rearrange- 
ment operations in event based reconstruction methods, 
e.g., [13]. Inversions and transpositions are the most 
often considered genomic rearrangement operation for 
phylogenetic reconstruction, e.g., [14]. But ideally, all 
four relevant types of rearrangement operations should 
be regarded. 

It can be found that some gene clusters are preserved 
during gene order evolution - be it for functional rea- 
sons or just by chance. Hence, several studies have 
investigated gene order rearrangement problems under 
the constraint that gene clusters are preserved. Typically, 
gene clusters are defined by a formal model in these 
studies. For gene orders without deleted or duplicated 
elements gene clusters are most often defined as com- 
mon intervals[15], i.e., a set of genes that occur continu- 
ously in each of the considered gene orders. The strong 
interval tree (SIT) is a data structure to efficiently 
represent the set of all common intervals of two or 
more permutations. Based on SITs an efficient exact 
algorithm for computing parsimonious inversion only 
preserving rearrangement scenarios for two given gene 
orders is presented in [16]. An extension of the SIT 
data structure is proposed in [17] for computing short- 
est preserving inversion scenarios for more than two 
gene orders. An interesting approach to identify inver- 
sions and transpositions for pairwise gene order com- 
parisons is to search for certain templates within the 
SIT [18]. In [19] the algorithm CREx (Common Interval 
Rearrangement Explorer) is presented which heuristi- 
cally infers a preserving rearrangement scenario for two 
gene orders by considering all four mentioned kinds of 
rearrangement operations. One principle of CREx is to 
identify different patterns that correspond to the differ- 
ent types of rearrangement operations in the SIT data 
structure. Extensions to CREx and the TreeREx 
approach for automatically computing the rearrange- 
ments in a given phylogenetic tree are presented in 
[20]. In this paper we do an extensive study of CREx on 
simulated data. Even though CREx has already success- 
fully been applied in several studies to biological data, e. 
g., [21] , such a systematic study is missing so far. We 
also propose a method for applying CREx to a large 
number of mitochondrial gene orders to identify likely 
rearrangement scenarios. The reconstructions that are 
obtained with the new method are stored in a so called 
rearrangement inventory graph (Rl-graph). The recon- 
structions in the Rl-graph are evaluated with a compre- 
hensive comparison to reconstructions published in the 
literature. 



Materials and methods 

Gene order comparison with CREx in a nutshell 

In the following we shortly introduce CREx and the SIT 
data structure. For more details see [19,20]. CREx com- 
pares two gene orders without duplicated or deleted 
genes which can be regarded as signed permutations, i. 
e., permutations with a sign (+/-) added to each ele- 
ment representing the orientation of the gene. A set of 
(unsigned) elements appearing consecutively in two 
gene orders is a common interval. Note that for the defi- 
nition of a common interval the orientation and order 
of its elements can differ in the two permutations. Two 
common intervals overlap if they have a nonempty 
intersection and none is included in the other. A com- 
mon interval is strong if it does not overlap any other 
common interval. The strong interval tree (SIT) is the 
graph where each node corresponds to a strong com- 
mon interval and is connected to the node representing 
the smallest superset of genes. A node is linear increas- 
ing (resp. decreasing) if the strong common intervals 
corresponding to its children are in the same (resp. 
reverse) order in the two compared permutations and 
prime otherwise. Each node of the SIT has a sign. The 
sign of a leaf node is given by the relative orientation of 
the corresponding element in the permutations. The 
sign of a linear node is + if it is linear increasing and - 
if it is linear decreasing. A prime node inherits the sign 
of a linear parent node and is + if no such node exists. 

Consider the signed permutation n - (6 9 7 10 8 1 -4 
5 -3 -2) and the identity permutation. In addition to the 
common interval {1,..., 10} and the singletons the two 
permutations have the following seven common inter- 
vals: {2, 3} {2, 3, 4, 5} {1, 2, 3, 4, 5}, {3, 4, 5}, {4, 5} {7, 8, 
9, 10} {6, 7, 8, 9, 10}. The common intervals {2, 3} and 
{3, 4, 5} are not strong because they overlap. The 
remaining common intervals are strong and define the 
structure of the SIT as shown in Figure 1. The node 
corresponding to the strong common interval {2, 3, 4, 5} 
is decreasing because the child strong common intervals 




6 9 7 10 8 
1 2 3^5 6 .9 7 108; 

iTiiTirririoirTirTirTii 5 11 -3 inn 12345 6 '7 8 910' 

Figure 1 Example SIT and CREx reconstruction. Left: SIT of the 
signed permutation n = (6 9 7 10 8 1 -4 5 -3 -2) and the identity 
permutation; the node type is represent by its colour and shape; 
prime nodes are represented by a blue node with rounded corners; 
linear increasing (resp. decreasing) nodes are represented by red 
(resp. green) rectangles; right: the corresponding rearrangement 
scenario derived by CREx. 
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occur in the opposite order in n, i.e. in n they are in the 
order {4, 5}, {3}, {2} whereas they are in the order {2}, 
{3}, {4, 5} in the identity. CREx is based on the observa- 
tion that the application of a single rearrangement 
operation leads to a pattern in the SIT which is specific 
for the type of the rearrangement, e.g., a transposition 
leads to a linear node with two linear children of oppo- 
site sign and a TDRL always leads to a prime node 
(unless its effect can also be described by a transposi- 
tion). CREx reconstructs a short rearrangement scenario 
by identifying these patterns in the SIT in a specific 
order. Special care is taken for prime nodes with 
inverted elements since these cannot be explained by 
TDRLs only. CREx also includes the possibility for alter- 
native scenarios, e.g., three inversions as an alternative 
for a transposition or several possible scenarios for 
prime nodes. Algorithm CREx, a tutorial, and several 
detailed examples are available online. 

CREx represents rearrangement scenarios from n to a 
in a tree data structure, defined recursively as follows. A 
scenario is either an ordered list of scenarios, a set of 
alternative scenarios, a set of pairwise commuting sce- 
narios, or a single atomic rearrangement operation. For 
a rearrangement scenario from n to a it holds that each 
linear lists of rearrangements generated from a traversal 
of the rearrangement tree where the (sub)scenarios of 
an ordered list are applied in the given order, the (sub) 
scenarios of a commuting scenario are applied in any 
order, only one out of alternative (sub)scenarios trans- 
forms Ji into O". 

For the permutation above CREx matches first the 
transposition pattern on the root node of the SIT and 
reports the transposition of the sets {1,..., 5} and {6,..., 
10}. Next CREx adds the inverse transposition of the 
elements {2, 3} on the other side of {4, 5} to the rearran- 
gement scenario since the corresponding pattern 
matches that for the node {2, 3, 4, 5}. The last pattern 
matching on a linear node is the inversion pattern of 
node {5}. Finally, the difference in the elements corre- 
sponding to the prime node add a TDRL that duplicates 
the elements {7, 8, 9, 10} and keeps the elements {7, 8} 
(resp. {9, 10}) in the first (resp. second) copy. 

Mitochondrial gene arrangement data set 

The data set used in this paper is based on the 1 701 
complete metazoan mitochondrial genomes in NCBI 
RefSeq [22], release 36 (July 2009). Since there exist sev- 
eral misannotations in the NCBI RefSeq data the tRNA 
annotation has been postprocessed with ARWEN [23] 
and tRNA-scan SE [24] (for more details see [25]). The 
gene arrangement data set consists of 185 unique gene 
orders having the standard set of 37 genes common to 
most metazoan mitochondrial genomes [1] representing 
1 458 gene orders in total. The relatively small number 



of unique gene orders is mainly due to the fact that for 
most Chordata species where the gene order is known 
orders it is the same. But also for species in some other 
phyla several species have the same gene order (e.g., 
birds and some deep sea fishes). 

Simulated gene arrangement data sets 

Each simulated rearrangement scenario is constructed 
by applying re [1 : 10] random rearrangements starting 
at the identity permutation of length n - 100. A given 
probability vector p = (p h p T , p iT , Ptdrl) specifies the 
probabilities that a rearrangement is an inversion (pi), a 
transposition (/? T ), an inverse transposition (/?rr)> or a 
TDRL (^tdrl)- Random rearrangements, i.e., inversions, 
transpositions, or inverse transpositions, are chosen with 
equal probability from the set of all possible respective 
operations. A random TDRL is generated by choosing 
uniformly at random the duplicated interval and for 
each element if it is deleted in the first or second copy. 
We have considered six rearrangement models: (I) 
inversions only, (T) transpositions only, (iT) inverse 
transpositions only, (TDRL) TDRLs only, (IT) inversions 
and transpositions both with the same probability, i.e., p 
- (0.5, 0.5,0, 0), (All) all four types of rearrangement 
operations with p - (0.3, 0.3, 0.3, 0.1). Furthermore, 
data sets have been constructed where each rearrange- 
ment operation affects the order of at most we {10, 20, 
100} genes. For each combination of rearrangement 
model, r, and w, 1000 data sets have been simulated, i.e., 
altogether 600 000. 

Each pair of the identity permutation with one of the 
generated permutations has been used as input for 
CREx. Let S be the set of rearrangement operations in 
the simulated scenario and C the rearrangement sce- 
nario computed by CREx. The quality of the CREx sce- 
nario is measured by precision = (\S n C|)/|C| and 
recall = (\S n C|)/|5| for \S\ * 0 A \C\ * 0 (if \S\ = |C| 
= 0 precision and recall are 1, if \S\ = 0 A |C| * 0 or \S\ 
* 0 A |C| =0 precision and recall are 0). The "intersec- 
tion" operation between S and C is computed with a 
corresponding function provided by TreeREx [20] which 
determines the longest common suffix between S and C 
or more precisely between any pair of the linear lists of 
rearrangements generated from a traversal of the tree 
representations of S and C. 

Building the rearrangement inventory graph 

Let II = {7TL,..., 7z>} be a set of gene orders. A directed 
graph G = (II, E t U E p ) — called the rearrangement inven- 
tory graph (RI-graph)— is defined on the set of nodes II 
where the edges represent rearrangement scenarios recon- 
structed by CREx. Recall, the aim is to represent only 
those rearrangement scenarios that can be considered as 
likely candidates for rearrangements that might have 
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happened during evolution. Therefore, edge set E t contains 
only edges between two permutations iii , 7ij for which the 
corresponding SIT has only linear nodes. Furthermore, for 
at least one of the two permutations, e.g., iii , it must hold 
that there does not exist a third permutation in II which 
has a smaller distance to iii than the distance between iii 
and Tiy The distance d between two permutations is 
defined as the length of the corresponding CREx scenarios 
(in case of alternative scenarios the shorter alternative is 
regarded). More exactly, two nodes n b Tij e V, with 1 < i * 
j < k are connected by an edge in £/ if, i) the SIT for {n b 
ttj} is linear, and ii) either there is no gene order n h e II, 
with i * h * j such that the SIT for {n h , ttj} is linear and d 
(jit, n h ) <d(ji b ttj) or the analogous statement holds for ; 
instead of L Observe, that the scenarios corresponding to 
edges in E t do not contain any TDRL operation. Hence, 
edge set E p is introduced to consider likely rearrangement 
scenarios containing a TDRL operation. Since TDRL 
operations are considered as rare, scenarios that have 
more than one TDRL operation are not considered for 
introducing an edge to E p . Formally, there exists an edge 
(jii, ttj) g E p if, i) the rearrangement scenario from iii to 7T ; 
includes exactly one TDRL, ii) iii and Jij are in different 
connected components of the graph (II, £/), and iii) 
d(7u if 7Uj) = mm {d(7u h ,7Uj)} The second condition states 

he[l:k] 

that a scenario with TDRL operations is not considered 
when the corresponding permutations are connected in 
(II, £/), i.e., they are already related to each other with a 
likely scenarios not requiring TDRL operations. 

Representation of the rearrangement inventory graph 

The nodes of the RI-Graph are shown as rectangles 
labelled with the accession number of one representative 
species and the number of species with the same gene 
order in parentheses if larger than one. Edges are 
coloured with respect to the represented rearrangement 
scenario such that the fractions of TDRLs, inversions, 
and transpositions correspond, respectively, to the inten- 
sity of the colours red, green, and blue. For the purpose 
of colouring an inverse transposition is counted as an 
inversion plus a transposition. The direction of a scenario 
that contains a TDRL is indicated by a directed edge. 
Each edge is labelled with the corresponding unique 
identifiers of the rearrangements that are predicted by 
CREx. An index of the identifiers of all predicted rearran- 
gements is given in Additional file 1. The layout of the 
graph was done manually starting from an initial layout 
computed with Graphviz (http://www.graphviz.org). 

Results and discussion 

Empirical analysis of CREx reconstructions 

An empirical analysis of the quality of the reconstructions 
of CREx on the simulated data is presented in the follow- 
ing for the different models of genome rearrangement. 



Clearly, it can not be expected that CREx is able to 
reconstruct the simulated scenarios for a large number r 
of rearrangement operations. The reason is that there 
exist too many possible rearrangements and also the 
shortest possible rearrangement scenario between two 
simulated permutations is not necessarily the simulated 
one. The hope is, that for a small number of rearrange- 
ment operations CREx can deliver good reconstructions. 
In that case such reconstructions might be useful as a 
basis for the analysis of the phylogeny even for a large 
number species (as done in the second part of the results 
section) because CREx is also very fast. For the 600 000 
reconstructions CREx needed 21 min 54 s on a laptop 
with a 2.0 GHz processor, i.e., one reconstruction in « 
10 -3 s on average. 
Reconstruction quality 

Boxplots of the precision of the CREx reconstructions 
for different numbers of rearrangement operations r are 
shown in Figure 2 (left). The corresponding plots for 
the recall values are omitted (see Additional File 1), 
because they are very similar to the corresponding pre- 
cision values, i.e., the average precision and recall values 
differ by < 0.05. 

It can be seen that for r = 1 the correct scenario was 
always found by CREx. For r = 2 and I (resp.. TDRL) 
the majority of the rearrangement scenarios, i.e., 686 
(resp. 710), has been reconstructed correctly. For the 
rearrangement models IT and All the correct scenario 
has been reconstructed for considerably more than one 
third of the data sets (425 and 435). A correct scenario 
has been reconstructed for less than one third (306 resp. 
211) of the data sets for T and iT. For a large number 
of rearrangements (r > 9 and I, r > 8 and IT, r > 5 and 
TDRL, All, r > 4 and T, r > 3 and iT) a reconstruction 
with a recall of one could not be done for a single data 
set. But often at least a part of the rearrangement sce- 
nario could still be reconstructed, e.g., for at least 678 of 
the data sets generated with I for r < 10 and the major- 
ity of the IT (T, iT, TDRL) data sets for r < 6 (r < 4, r < 
2, r < 3). 

Thus, for small values of r the quality of the recon- 
structed scenarios is good, but decreases for larger 
values of r. CREx is able to reconstruct at least a part of 
the simulated rearrangement scenario for many data 
sets with medium values of r. This is an interesting 
observation, because a partially correct reconstruction 
might suffice for the correct operation of TreeREx [20] 
which is based on determining common suffixes in rear- 
rangement scenarios leading to a permutation which has 
been determined with CREx. The results are apparently 
better when no transpositions or inverse transpositions 
are applied. Although these first results for simulated 
data seem not to be very promising, additional biologi- 
cally relevant criteria and constraints to the 
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Figure 2 Reconstruction Quality of CREx. Precision of CREx reconstructions for the simulated data sets for the different rearrangement 
models; left: precision for different rearrangement numbers r e [1 : 10]; right: precision for different numbers of prime nodes p of the SIT. 



rearrangement models can be identified where CREx' 

performance is better. 

Sensitivity to strong interval tree structure 

In this subsection it is shown that the reconstruction 
quality depends to a large extent on the properties of 
the SIT corresponding to the pair of permutations ana- 
lysed by CREx. Figure 2 (right) shows the precision for 
the simulated data sets in relation to the number of 
prime nodes in the corresponding SITs. Plots for recall 
are given in Additional File 1 (difference of the average 
values < 0.06). 

The results for the TDRL model differ from the 
results of the other rearrangement models in an increas- 
ing reconstruction quality for larger values of p. This is 
discussed first. The cases with p = 0 generated with 
TDRLs correspond to one of the rare cases (155,% for r 
= 1 and 19x times for k - 2) where the random TDRL 
has the same effect as a transposition. If the random 
TDRLs do not overlap they create separate prime nodes, 
i.e., k = p (the inverse does not necessarily hold). Such 
cases can easily be reconstructed by CREx. This is con- 
firmed by the observation that for p = k the precision 
and recall values have been 1 in all but two cases for k 
= 2. Furthermore, the fraction of data sets with k - p 
increases with p, from 0.11 for p = 1 to 0.6 for p = 4. 
This explains the difference of the results for large p for 
the TDRL model. The reconstructed scenarios of CREx 
are mostly correct when the SIT has no prime node. 
This holds for all rearrangement models. But, when the 
SIT has a prime node, a large fraction of the recon- 
structed rearrangements is not correct. 

There are 10833 random data with prime node free 
SIT. For 8101 (« 75%) of these data sets the CREx 
reconstruction is correct and at least partially correct 



for 9 616 (« 89%). For the remaining 49 167 data sets, 
which have a prime node in the SIT, the CREx recon- 
struction is correct only 2128 (~ 4%) times and at least 
a part of the reconstructed scenario was correct for only 
17 741 (« 36%) of these data sets. For r = 1 the SIT is, 
except for TDRL, always prime node free. But also for 
the data sets with r > 1 the absence of prime nodes in 
the SIT is still a good indicator for the quality of CREx' 
reconstructions. This is, « 53% correct (resp. « 79% par- 
tially correct) reconstructions for data sets with prime 
node free SIT compared to only « 7% (resp.. ~ 35%) 
when the SIT has prime nodes. 

The presented results clearly show that the absence of 
prime nodes in the SIT is a good indicator for the qual- 
ity of the rearrangement scenario reconstructed by 
CREx. This is remarkable because pairwise comparisons 
of metazoan mitochondrial gene arrangements often 
correspond to prime node free SITs and most of the 
gene orders take part in at least one such comparison 
(see Additional File 1). 
Different rearrangement sizes 

The probability of the rearrangements may depend on 
additional properties in real world scenarios, e.g., short 
rearrangement operations are found more often [26] 
(and references therein). In the following the influence 
of the length (measured as the number w of influenced 
genes) of the rearrangement operations on the recon- 
struction quality of CREx is analysed (Figure 3). It can 
be seen that the reconstructions of CREx are of much 
better quality for smaller values of w. When comparing 
the unrestricted case (w = 100) and the case w = 10 an 
improvement of the average precision and recall of at 
least 0.44 was measured for all values r > 2. Of course 
the structure of the SIT depends on the applied 
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Figure 3 Reconstruction Quality of CREx for Different Rearrangement Sizes. Average precision of CREx for simulated data sets for different 
numbers of rearrangements r and different numbers of affected elements w; averages are computed over the results of all five rearrangement 
models for each combination of r and w. 



rearrangements. For short rearrangement operations the 
number of simulated data sets without prime nodes 
increases, e.g., 53 819 data sets for w > 50 and 88558 
for w < 50 are prime node free. Thus, the effects of 
increased reconstruction quality in the case of prime 
node absence and shorter rearrangement operations are 
not independent. See [27] for a formal study of the rela- 
tion of rearrangement length and the properties of the 
generated gene clusters. The presented results indicate 
that the quality of CREx' reconstructions is improved 
for gene orders that evolved with short rearrangements. 
Clearly, since there is no accepted model for mitochon- 
drial gene order evolution this does not necessarily 
imply an improved reconstruction quality of CREx for 
mitochondrial gene orders. 

Inventory of metazoan mitochondrial gene order 
rearrangements 

The RI-Graph has been computed for the set of all 185 
unique complete Metazoan gene orders. The computa- 
tion needed 30s on a laptop with a 2.0 GHz CPU. The 
185 nodes of the resulting RI-Graph are organised in 
several connected components. Most of the connected 
components are small: 29 nodes are singletons, nine 
components contain two nodes, six components contain 
three nodes, there exists one component of size five, 
and one of size eight. Additionally, there are two large 
components containing together more than half of the 
nodes. One of these components has 45 nodes and 
represents, with the exception of one Priapulid, gene 
orders from arthropod species. The other large compo- 
nent contains 62 nodes which correspond to gene 



orders of Chordata plus one Xenoturbellida and one 
Hemichordata. 

In the following three large connected components of 
the RI-Graph are analysed in more detail. A more com- 
prehensive analysis of the results is given in [25]. Note, 
that the study presented in the following is not intended 
to be phylogenetically conclusive. 
Mollusca 

The mollusc gene orders are organised in five connected 
components of size greater than one and three single- 
tons nodes (S. lobatum, G. eborea, and P. dolabratd). 

Scenarios for five gastropod gene orders are given in 
the connected component shown in Figure 4 (left). 119 
and 120 are presumably caused by tRNA annotation 
errors since ARWEN and tRNAscan already report five 
differently oriented tRNAs for NC_01022. Hence, we 
exclude B. Tenagophila from the following discussions. 
The transpositions T100 and T102-104 are given in 
[28]. Transposition T101 suggest a previously unre- 
ported alternative scenario. Assuming the tree topology 
given in [28] a parsimony analysis of the presented 
results suggest that the gene arrangement of C. nemora- 
lis can not be ancestral. In addition to the three other 
gene orders the gene arrangement separated by T100 
from C. nemoralis, by T101 from B. glabrata, and by 
T102 from A. coerulea is a putative ancestral gene 
order. But note that every unrooted tree topology 
including the four species is equally parsimonious. 

The largest connected component of mollusc gene 
orders (right part of Figure 4) contains gene orders from 
Gastropoda, Cephalopoda, and one Chitonid. Transposi- 
tions T16-19 and inversion 14 are reconstructed as 
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Figure 4 Large Mollusca Components. The two larger connected components including mitochondrial gene orders from Mollusca; left) five 
Gastropoda; right) four Gastropoda, one Chitonid, and three Cephalopoda 



reported in [29], inverse transposition iT4 is given there 
as two separate inversions. Note that the CREx scenarios 
between the gene orders of O. vulgaris, H. rubra, and /. 
obsoleta have many common rearrangements. This pro- 
poses the gene order separated by iT4 from /. obsoleta, 
by T18 from O. vulgaris, and by T16, T17 from H. 
rubra as the ancestral gene arrangement for at least two 
of the three gene arrangements (most likely of the two 
Gastropoda). Assuming any of the three gene orders as 
ancestral would not be parsimonious. TDRL7 is pre- 
sented in [6]. The scenario to N. macromphalus is pro- 
posed as "transposition of two large blocks and 
transposition of F" instead of TDRL8, but transposition 
T20 is reported equivalently in [30]. 
Arthropoda 

The gene orders of the Arthropoda are clustered in nine 
connected components. The component containing 45 of 
the 77 unique arthropod gene orders is shown in Figure 5. 
The subgraph defined by the nodes NC_002010, 
NC_000844, and NC_002355 and nodes which are only 
adjacent to one of these three nodes (plus one chain of 
nodes - NC_011243, NC_007010, NC_006081 - con- 
nected to NC_000844) is given in the upper part of the fig- 
ure. The remaining part of the connected component is 
presented in the lower part. The component contains two 
unique gene orders representing Hexapoda and Crustacea 
(Pancrustacea). That is, NC_000844 represents 90 Hexa- 
pod and 14 Crustacean gene orders and node NC_0 12421 
represents gene orders of two Hexapoda and one Crusta- 
cea. The gene order corresponding to node NC_000844 is 
considered to be the ancestral Pancrustacean gene order 
[31] and the gene order of L. polyphemus (NC_002010) is 
regarded as the ancestral arthropod gene order [32]. 



Several rearrangements that have been found occur 
several times between different pairs of permutations. 
Therefore, some of these rearrangements can be found 
several times in the literature. These are for example i) 
T26 is a swap of / and Q, e.g. [33,34], ii) iT21 is an 
inverse transposition of L2[l,35], iii) T58 is a transposi- 
tion of M and the pair /, Q [36], and iv) T2 is swap of P 
and T, e.g., [8,33]. These rearrangements are of great 
interest because they have been found several times 
between or within different taxonomic groups and may 
be examples of convergent or synapomorphic rearrange- 
ments, as discussed in [37]. The results presented here 
will be of great help to investigate such problems. 

First, some rearrangements shown in the lower part of 
Figure 5 are analysed. Using the presented method 
CREx automatically found TDRLs 16,18,20, and 21 
which are striking examples of the tandem duplication 
random loss model of genome rearrangement. TDRLs 
20 and 21 have been presented in [8,9] differing only by 
the exclusion of a single tRNA from the TDRLs favour- 
ing an additional transposition of the tRNA in both 
cases. This is because the control region is used in these 
studies as additional evidence. It is interesting that 
seven (five) transpositions would be necessary for an 
alternative explanation of TDRL20 (resp. TDRL21). 
Three TDRLs are needed for the rearrangements in the 
opposite direction in both cases. Note that [38] postu- 
lated "at least 10 translocations" for the rearrangements 
leading to C. Coleoptrata. In [39] tandem duplication 
random loss was discussed as a possible cause of the 
rearrangement in the undescribed Lepidopsocrid species, 
but the actual TDRL (18) is given here for the first time. 
In [40] the gene orders of C. destructor and D. pulex 
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Figure 5 Large Arthropoda Component. Large connected component of arthropod mitochondrial gene orders; top: nodes which are only 
adjacent to nodes NC 002010, NC 000844, and NC 002355 plus chain of 4 nodes adjacent to NC 000844; bottom: rest of the component. 



have been compared and different positions for eleven 
genes have been noticed where for two genes inversion 
is involved. "For nine of the translocations, the 'duplica- 
tion/random loss' mechanism" was suggested to be 



plausible and "a minimum of five independent duplica- 
tion/random loss events" have been postulated. The 
automatic reconstruction using CREx - iT22, iT23, 
TDRL16 - matches the description perfectly. But 
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potential misannotation of the V and the rRNAs in S. 
arsenjevi, S. mantis, and M. japonicus make further 
investigations necessary. Instead of TDRL19 two trans- 
positions are proposed in [3], but iT28 is given equiva- 
lently. In [10] the gene orders of L. polyphemus and N. 
annularus have been compared and a tandem duplica- 
tion non-random loss rearrangement and a transposition 
have been proposed. That is in each copy only genes on 
the same strand are lost (with one exception). CREx 
reconstructs the same rearrangements (T141 and 
TDRL22) but with an intermediate step via the gene 
order of Nothopuga sp. by transposition T136 [9]. T144 
and iT29 are reconstructed as in [5], where it was 
speculated that the transposition is derived from the 
pre-"non random loss" tandem duplicated gene arrange- 
ment that gave rise to the N. annularus gene order. 
Rearrangement counts 

Table 1 shows the numbers of rearrangements opera- 
tions found in the analysed connected components of 
the RI-Graph, i.e., without the components containing 
Deuterostomia. Note, that the total number of unique 
rearrangements can be less than the sum of the corre- 
sponding numbers of known, different and new rearran- 
gement operations, because the same rearrangement can 
occur several times in the RI-Graph and might be 
counted differently for different pairs gene orders. The 
strength of the proposed RI-Graph approach is shown 
by the fact that most of the proposed rearrangements 
are likely to be correct in the sense that they are in 
agreement with the literature. Within the analysed com- 
ponents a high number of transpositions have been 
found. This is in agreement with the results presented 
in [4]. Also [37] manually identified 43 transpositions 
within the presented 67 comparisons of mitochondrial 
gene orders of Hymenoptera. For two bacterial genomes 
also a high number of transpositions was reported in 
[41]. 

We would like to stress the point that these findings 
are apparently in disagreement with weighting schemes 
which put more weight on transpositions, e.g., [13,42]. 
This is done to avoid a bias favouring transpositions 
since they are argued to be "observed much less 



Table 1 Rearrangement Type Counts 





# 


known 


diff 


new 


1 


44 (26) 


13 (9) 


14(8) 


17 (15) 


T 


109 (71) 


70 (42) 


14(11) 


25 (24) 


iT 


18 (14) 


10 (8) 


4(2) 


4(4) 


TDRL 


16 (16) 


7 (7) 


3 (3) 


6(6) 



Number of rearrangement operations found in the analysed connected 
components of the RI-Graph; #: total number of rearrangements in the graph; 
known: rearrangements in agreement with the literature; diff: rearrangements 
reported differently in the literature or which are caused by annotation errors; 
new: rearrangements that have not been found in the literature; in 
parentheses: number of unique rearrangements 



frequently than inversions in many evolutionary con- 
texts" [42] and also [43]. The results presented here 
indicate that it is necessary to re-examine — at least for 
intra phylum comparisons — the weighting schemes 
that are typically used in the literature for the analysis 
of mitochondrial gene order rearrangements. 

Conclusions 

The quality of the mitochondrial gene order rearrange- 
ment scenarios reconstructed with CREx has been ana- 
lysed in an extensive study on artificial gene orders that 
have been simulated with different rearrangement mod- 
els. The four type of rearrangement operations - inver- 
sion, transposition, inverse transposition, and tandem 
duplication random loss (TDRL) — that are relevant for 
mitochondrial gene order evolution have been consid- 
ered. It was shown that the absence of prime nodes in 
the strong interval tree (SIT) and a not too large number 
of genes involved in the rearrangement operations are 
good criteria indicating reliable CREx reconstructions. 

Based on the simulation results a method has been 
proposed that can be used for automatically analysing a 
large number of mitochondrial gene orders in order to 
find a likely subset of scenarios from pairwise gene 
order comparisons. The found scenarios are stored in 
an RI-Graph data structure. The proposed method was 
applied to all known metazoan mitochondrial gene 
orders resulting in an RI-Graph where most gene orders 
within the same phylum are connected. The potential of 
the new method was shown by the large agreement with 
the literature on gene order rearrangements within the 
Protostomia. 

Mitochondrial gene arrangement data are mostly still 
analysed manually by biologists [37]. Such manual ana- 
lyses are valuable and indispensable, but the handling of 
the huge amount of available data is at least tedious and 
may also be considered as impossible. Therefore, usually 
only a very small number of gene arrangements is com- 
pared (with notable exceptions [9,37]), only a subset of 
the genes is evaluated (usually tRNAs are excluded), or 
only a part of the arrangements is compared (usually all 
species are compared to a putative ancestral gene order 
[37], or based on a phylogenetic tree [9]). In this way 
the phylogenetic signal which is or is not contained in 
gene order data can not be properly analysed or impor- 
tant alternative rearrangement scenarios may be missed. 
The new method facilitates the automatic analysis of 
gene orders and the reconstruction of rearrangement 
scenarios. Within less than a minute the results for the 
complete mitochondrial data set can be computed. It 
was shown here that CREx allows for a comprehensive 
analysis of the rearrangements, within the connected 
components, solving some of the problems mentioned 
above, in an unprecedented and efficient way. 
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Furthermore, the possibility to identify possible ancestral 
states and cases of convergent evolution has been indi- 
cated. For future work it is promising to refine and 
extend the proposed method, e.g. the inclusion of meth- 
ods for the handling of missing genes or duplicates. 
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